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Abstract 

This paper considers the multi-armed bandit problem with multiple simultaneous arm pulls. 
We develop a new 'irrevocable' heuristic for this problem. In particular, we do not allow re- 
course to arms that were pulled at some point in the past but then discarded. This irrevocable 
property is highly desirable from a practical perspective. As a consequence of this property, our 
heuristic entails a minimum amount of 'exploration'. At the same time, we find that the price 
of irrevocability is limited for a broad useful class of bandits we characterize precisely. This 
class includes one of the most common applications of the bandit model, namely, bandits whose 
arms are 'coins' of unknown biases. Computational experiments with a generative family of 
large scale problems within this class indicate losses of up to 5 — 10% relative to an upper bound 
on the performance of an optimal policy with no restrictions on exploration. We also provide 
a worst-case theoretical analysis that shows that for this class of bandit problems, the price of 
irrevocability is uniformly bounded: our heuristic earns expected rewards that are always within 
a factor of 1/8 of an optimal policy with no restrictions on exploration. In addition to being an 
indicator of robustness across all parameter regimes, this analysis sheds light on the structural 
properties that afford a low price of irrevocability. 
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1 Introduction 



Consider the operations of a 'fast-fashion' retailer such as Zara or H&M. Such retailers have devel- 
oped and invested in merchandize procurement strategies that permit lead times for new fashions 
as short as two weeks. As a consequence of this flexibility, such retailers are able to adjust the 
assortment of products offered on sale at their stores to quickly adapt to popular fashion trends. 
In particular, such retailers use weekly sales data to refine their estimates of an item's popularity, 
and based on such revised estimates weed out unpopular items, or else re-stock demonstrably pop- 
ular ones on a week-by-week basis. In sharp contrast, traditional retailers such as J.C. Penney or 
Marks and Spencer face lead times on the order of several months. As such these retailers need to 
predict popular fashions months in advance and are allowed virtually no changes to their product 
assortments over the course of a sales season which is typically several months in length. Under- 
standably, this approach is not nearly as successful at identifying high selling fashions and also 
results in substantial unsold inventories at the end of a sales season. In view of the great deal of 
a-priori uncertainty in the popularity of a new fashion and the speed at which fashion trends evolve, 
the fast-fashion operations model is highly desirable and emerging as the de-facto operations model 
for large fashion retailers. 

Among other things, the fast-fashion model relies crucially on an effective technology to learn 
from purchase data, and adjust product assortments based on such data. Such a technology must 
strike a balance between 'exploring' potentially successful products and 'exploiting' products that 
are demonstrably popular. A convenient mathematical model within which to design algorithms 
capable of accomplishing such a task is that of the multi- armed bandit. While we defer a precise 
mathematical discussion to a later section, a multi-armed bandit consists of multiple (say n) 'arms', 
each corresponding to a Markov Decision Process. As a special case, one may think of each arm 
as an independent binomial coin with an uncertain bias specified via some prior distribution. At 
each point in time, one may 'pull' up to a certain number of arms (say k < n) simultaneously, 
or equivalently, toss up to a certain number of coins. For each tossed coin, we earn a reward 
proportional to its realization and are able to refine our estimate of its bias based on this realization. 
We neither learn about, nor earn rewards from coins that are not tossed. The multi- armed bandit 
problem requires finding a policy that adaptively selects k arms to pull at every point in time with 
a view to maximizing total expected reward earned over some finite time horizon or alternatively, 
discounted rewards earned over an infinite horizon or perhaps, even long term average rewards. 

With multiple simultaneous pulls allowed, the multi-armed bandit problem we have described is 
computationally hard. A popular and empirically successful heuristic for this problem was proposed 
several decades ago by Whittle. Whittle's heuristic produces an index for every arm based on the 
state of that arm and simply calls for pulling the k arms with the highest index at every point in 
time. While it has been empirically and computationally observed that Whittle's heuristic provides 
excellent performance, the heuristic typically calls for frequent changes to the set of arms pulled 
that might, in hindsight, have been unnecessary. For instance, in the retail context, such a heuristic 
may choose to discard from the assortment a product presently being offered for sale in favor of a 
new product whose popularity is not known precisely. Later, the heuristic may well choose to re- 
introduce the discarded product. While such exploration may appear necessary if one is to discover 
profitable bandit arms (or popular products), enabling such a heuristic in practice will typically 
call for a great number of adjustments to the product assortment - a requirement that is both 
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expensive and undesirable. This begs the following question: Is it possible to design a heuristic 
for the multi-armed bandit problem that comes close to being optimal with a minimal number of 
adjustments to the set of arms pulled over time? 

This paper introduces a new 'irrevocable' heuristic for the multi-armed bandit problem we call 
the packing heuristic. The packing heuristic establishes a static ranking of bandit arms based on a 
measure of their potential value relative to the time required to realize that value, and pulls arms 
in the order prescribed by this ranking. For an arm currently being pulled, the heuristic may either 
choose to continue pulling that arm in the next time step or else discard the arm in favor of the 
next highest ranked arm not currently being pulled. Once discarded, an arm will never be chosen 
again; hence the term irrevocable. Irrevocability is an attractive structural constraint to impose 
on arm selection policies in a number of practical applications of the bandit model such as the 
dynamic assortment problem we have discussed or sequential drug trials where recourse to drugs 
whose testing was discontinued in the past is socially unacceptable. It is clear that an irrevocable 
heuristic makes a minimal number of changes to the set of arms pulled. What is perhaps surprising, 
is that the restriction to an irrevocable policy is typically far less expensive than one might expect. 
In particular, we demonstrate via a theoretical analysis and computational experiments that the 
use of the packing heuristic incurs a small performance loss relative to an optimal bandit policy 
with no restriction on exploration, i.e. an optimal strategy that is allowed recourse to arms that 
were pulled but discarded in the past. 

More specifically, the present work makes the following contributions: 

• We introduce a new 'irrevocable' heuristic, the packing heuristic, for the multi-armed bandit 
problem with multiple simultaneous arm-pulls. The packing heuristic is irrevocable in that if 
an arm being pulled is at some point discarded from the set of arms being pulled, it is never 
pulled again. At the same time, the performance loss incurred relative to an optimal, po- 
tentially non-irrevocable, control policy is limited. In particular, computational experiments 
with the packing heuristic for a generative family of large scale bandit problems indicate per- 
formance losses of up to about a few percent relative to an upper bound on the performance 
of an optimal policy with no restrictions on exploration. This level of performance suggests 
that the packing heuristic is likely to serve as a viable heuristic for the multi-armed bandit 
with multiple plays even when irrevocability is not a concern. 

In addition to our computational study, we are able to demonstrate a uniform bound on 
the price of irrevocability for a broad, interesting class of bandits. This class includes most 
commonly used applications of the bandit model such as bandits whose arms are 'coins' of 
unknown biases. We demonstrate that the packing heuristic earns expected rewards that 
are always within a factor of 1/8 of an optimal, potentially non-irrevocable policy. Such a 
uniform bound guarantees robust performance across all parameter regimes; in particular, 
the packing heuristic will 'track' the performance of an optimal, potentially non-irrevocable 
policy across all parameter regimes. In addition, our analysis sheds light on the structural 
properties that afford the surprising efficacy of the irrevocable policies considered here. 

• In the interest of practical applicability, we develop a fast combinatorial implementation of the 
packing heuristic. Assuming that an individual arm has 0(E) states, and given a time horizon 
of T steps, optimal solution to the multi- armed bandit problem under consideration requires 
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0{T, n T n ) computations. The main computational step in the packing heuristic calls for the 
one time solution of a linear program with O(nST) variables, whose solution via a generic LP 
solver requires 0(n 3 S 3 T 3 ) computations. We develop a novel combinatorial algorithm that 
solves this linear program in 0(nS 2 T log T) steps by solving a sequence of dynamic programs 
for each bandit arm. The technique we develop here is potentially of independent interest for 
the solution of 'weakly coupled' optimal control problems with coupling constraints that must 
be met in expectation. Employing this solution technique, our heuristic requires a total of 
0{nT? logT) computations per time step amortized over the time horizon. In comparison, the 
simplest theoretically sound heuristics in existence for this multi-armed bandit problem (such 
as Whittle's heuristic) require 0(nS 2 T) computations per time step. As such, we establish 
that the packing heuristic is computationally attractive. 



1.1 Relevant Literature 



The multi-armed bandit problem has a rich history, and a number of excellent references (such as 
Gittins ( 1989| )) provide a thorough treatment of the subject. We review here literature especially 
relevant to the present work. In the case where k = 1, that is, allowing for a single arm to be pulled 



in a given time step, Gittins and Jones (1974 1 developed an elegant index based policy that was 



shown to be optimal for the problem of maximizing discounted rewards over an infinite horizon. 
Their index policy is known to be suboptimal if one is allowed to pull more than a single arm 



in a given time step. Whittle (1988) developed a simple index based heuristic for a more general 



bandit problem (the 'restless' bandit problem) allowing for multiple arms to be pulled in a given 
time step. While his original paper was concerned with maximizing long-term average rewards, 
his heuristic is easily adapted to other objectives such as discounted infinite horizon rewards or 



expected rewards over a finite horizon (see for instance Caro and Gallien (20071, Bertsimas and 



Nino-Mora ( 2000[ )). Weiss ( 1992 ) subsequently established that under suitable conditions, Whittle's 
heuristic was asymptotically optimal (in a regime where n and k go to infinity keeping n/k constant). 
Whittle's heuristic may be viewed as a modification to the optimal control policy one obtains upon 
relaxing the requirement that at most k arms be pulled in a given time step to requiring that at 
most k arms be pulled in expectation in any given time step. The packing heuristic we introduce is 
motivated by a similar relaxation. In particular, we restrict attention to policies that entail a total 
of at most kT arm pulls over the entire horizon in expectation while allowing for no more than T 
pulls of any given arm. Where we differ substantially from Whittle's heuristic is the manner in 
which we construct a feasible policy (one where at most k arms are pulled in a given time step) 
from the relaxed policy. In fact there are potentially many reasonable ways of transforming an 
optimal policy for the relaxed problem to a feasible policy for the multi-armed bandit; for instance 



Bertsimas and Nino- Mora (20001 use a scheme distinct from both Whittle's and ours, that employs 



optimal primal and dual solutions to a linear programming formulation of Whittle's relaxation to 
construct an index heuristic for arm selection. Nonetheless, none of these schemes are irrevocable 
and nor do they offer non-asymptotic performance guarantees, if any. 

The packing heuristic policy builds upon recent insights on the 'adaptivity' gap for stochastic 



packing problems. In particular, Dean et al. (2004) recently established that a simple static rule 



(Smith's rule) for packing a knapsack with items of fixed reward (known a-priori), but whose sizes 
were stochastic and unknown a-priori was within a constant factor of the optimal adaptive packing 
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policy. Guha and Munagala (20071 used this insight to establish a similar static rule for 'budgeted 
learning problems'. In such a problem one is interested in finding a coin with highest bias from 
a set of coins of uncertain bias, assuming one is allowed to toss a single coin in a given time step 
and that one has a finite budget on the number of such experimental tosses allowed. Our work 



parallels that work in that we draw on the insights of the stochastic packing results of Dean et al. 



(2004 1 . In addition, we must address two significant hurdles - correlations between the total reward 
earned from pulls of a given arm and the total number of pulls of that arm (these turn out not 
to matter in the budgeted learning setting, but are crucial to our setting), and secondly, the fact 
that multiple arms may be pulled simultaneously (only a single arm may be pulled at any time in 
the budgeted learning setting). Finally, a working paper ( jBhattacharjee et al. (2007)), brought to 
our attention by the authors of that work considers a variant of the budgeted learning problem of 



Guha and Munagala (2007) wherein one is allowed to toss multiple coins simultaneously. While it 



is conceivable that their heuristic may be modified to apply to the multi-armed bandit problem we 
address, the heuristic they develop is also not irrevocable. 

Restricted to coins, our work takes an inherently Bayesian views of the multi-armed bandit 
problem. It is worth mentioning that there are a number of non-parametric formulations to such 
problems with a vast associated literature. Most relevant to the present model are the papers by 
Anantharam et al. ( 1987a|b" ) that develop simple 'regret-optimal' strategies for multi- armed bandit 



problems with multiple simultaneous plays. 

Our development of an irrevocable policy for the multi-armed bandit problem was originally 
motivated by applications of this framework to 'dynamic assortment' problems of the type men- 
tioned in the introduction. In particular, Caro and Gallien (2007) computationally explore the use 
of a number of simple index-type heuristics (similar to Whittle's heuristic) for such problems, none 
of which are irrevocable; nonetheless, they stress the importance of a minimal number of changes 
to the assortment if any such heuristic is to be practical. 

The remainder of this paper is organized as follows. Section 2 presents the multi- armed bandit 
model we consider and develops an (intractable) LP whose solution yields an optimal control policy 
for this bandit problem. Section 3 develops the packing heuristic by considering a suitable relaxation 
of the multi-armed bandit problem. Section 4 introduces a structural property for bandit arms we 
call the 'decreasing returns' property. It is shown that a useful class of bandits, namely the 'coin' 
bandits relevant to the applications that motivate us, possess this property. That section then 
establishes that the price of irrevocability for bandits possessing the decreasing returns property 
is uniformly bounded. Section 5 presents very encouraging computational experiments for large 
scale bandit problems drawn from a generative family of coin type bandits. In the interest of 
implement ability, Section 6 develops a combinatorial algorithm for the fast computation of packing 
heuristic policies for multi- armed bandits. Section 7 concludes with a perspective on interesting 
directions for future work. 



2 Model 

We consider a multi-armed bandit problem with multiple simultaneous 'pulls' permitted at every 
time step. A single bandit arm (indexed by i) is a Markov Decision Process (MDP) specified by 
a state space Si, an action space, Ai, a reward function ri : Si x Ai — > M+, and a transition 
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kernel Pi : Si x Ai —> As i (where is the \Si\ -dimensional unit simplex), yielding a probability- 
distribution over next states should one choose some action cij G Ai in state Si G <Sj. 

Every bandit arm is endowed with a distinguished 'idle' action fc. Should a bandit be idled 
in some time period, it yields no rewards in that period and transitions to the same state with 
probability 1 in the next period. More precisely, 

n(si, (pi) = 0, Vsj G Si, 
Pi(si,(f>i,Si) = 1, Vsj G Si. 

We consider a bandit problem with n arms. In each time step one must select a subset of up 
to k{< n) arms for which one may pick any action available at those respective arms. Should an 
action other than the idle action be selected at any of these k arms, we refer to such a selection 
as a 'pull' of that arm. That is, any action cij G Ai \ {<fii} would be considered a pull of the ith 
arm. One is forced to pick the idle action for the remaining n — k arms. We wish to find an action 
selection (or control) policy that maximizes expected rewards earned over T time periods. Our 
problem may be cast as an optimal control problem. In particular, we define as our state-space 
the set S = YliSi and as our action space, the set A = YliAi. We let T = {0, 1, ... ,T — 1}. 
We understand by Si, the ith component of s G S and similarly let at denote the ith component 
of a G A. A feasible action is one which calls for simultaneously pulling at most k arms. In 
particular we let ,A feas = {a G A, J^i ^a^fa ^ ^} denote the set of all feasible actions. We define 
a reward function r:Sxi-t R + , given by r(s, a) = J2i r i( s ii a i) an d a system transition kernel 
P : S x A -> A n . 5 ., given by P{s,a,s') = IljP^Sj, a u s'J. 

We now formally develop what we mean by a control policy. The arm selection policy we will 
eventually develop will use auxiliary information aside from the current state of the system, and 
so we require a general definition. Let Xq be a random variable that encapsulates any endogenous 
randomization in selecting an action, and define the filtration generated by Xq and the history of 
visited states and actions by 

F t = o-(X ,(s ),(s 1 ,a°),...,(s t ,a t - 1 )), 

where s* and a 1 denote the state and action at time t, respectively. We assume that P(s* +1 = 
s'|s* = s, a 1 = a, Ht = ht) = P(s, a, s') for all s, s' G S, a G A, t G T and any .^-measurable random 
variable H t . A feasible policy simply specifies a sequence of .A feas -valued actions {a*} adapted to 
Tt- In particular, such a policy may be specified by a collection of o~(X°) measurable, .A feas -valued 
random variables, {/u(s°, . . . , s*, ao, • • • , a* -1 , t)}, one for each possible state-action history of the 
system. We let M denote the set of all such policies fi, and denote by J M (s, 0) the expected value 
of using policy fx starting in state s at time 0; in particular 



J M (s,0) = E 



T-l 



^i?( s *y)| s ° = s 



t=0 



where a* = fJ,(s°, . . . , s f , a , . . . , a' -1 , t). 

Our goal is to compute an optimal admissible policy. Markovian policies, i.e. policies under 
which a 1 is measurable with respect to a(X°,s t ), are particularly useful. A Markovian policy is 
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specified as a collection of independent „4 feas valued random variables {fi(s,t)} each measurable 
with respect to <j(Xq). In particular, assuming the system is in state s at time t, such a policy 
selects an action a* as the random variable fi(s,t), independent of past states and actions. We let 
M m denote the set of all such admissible Markovian policies. 

Every \i G M m is associated with a value function, : S x T — > M + which, for every (s, t) G 
S xT, gives the expected value of using control policy /U starting at that state: 



J^(s,t) = E 



T-l 



We denote by J* the optimal value function. In particular, J*(s,t) = sup^g^m J"(s,*). The 
preceding supremum is always achieved and we denote by ji* a corresponding optimal Markovian 
control policy. That is, fi* G argsup J M (s, i) for all (s,t) £ 5 xT. Our restriction to Markovian 

policies is without loss; M m always contains an optimal policy among the broader class of admis- 
sible policies so that sup^ gM m J M (s, 0) = sup MgM J^(s, 0) for all states s. We next formulate a 
mathematical program to compute such an optimal policy. 



2.1 Computing an Optimal Policy 

An optimal policy // may be found via the solution of the following linear program, LP(7Tq), 
specified by a parameter ttq G that specifies the distribution of arm states at time t = 0. 

s.t. H„ t(s, a, t) = E s ',a' p ( s ' - «'» ^(s'. * " !)- \ft>0,seS, 
7r(s, a, t) = 0, Vs, t,a$ A ieas 

J] a 7r(s,a,0) = 7T (s), Vs G S, 

it > 0. 

where the variables are the state action frequencies ir(s, a,t), which give the probability of being 
in state s at time t and choosing action a. The first set of constraints in the above program simply 
enforce the dynamics of the system, while the second set of constraints enforces the requirement 
that at most k arms are simultaneously pulled at any point in time. 

An optimal solution to the program above may be used to construct a policy [i* that attains 
expected value J*(s,0) starting at any state s for which no(s) > 0. In particular, given an opti- 
mal solution 7r opt to LP(ttq), one obtains such a policy by defining fi*(s,t) as a random variable 
that takes value a G A with probability 7r opt (s, a, t)j Y^ a ft opt {s, a, t). By construction, we have 
E[J*(s,0)\s ~ 7T"o] = OPT{LP{txq)). Of course, efficient solution of the above program is not a 
tractable task, which forces us to seek approximations to an optimal policy. The next section will 
present one such policy with an appealing structural property we term 'irrevocability'. 



3 An Irrevocable Approximation to the Optimal Policy 

This section develops an approximation to the optimal multi-armed bandit control policy that we 
will subsequently establish performs adequately relative to the optimal policy. This approximation 
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will possess a desirable property we term 'irrevocability'. In particular, the policy we develop will, 
at any time, be permitted to pull an arm only if that arm was pulled in the prior time step, or else 
never pulled in the past. 

We first develop a control policy for a related bandit problem, where the requirement that 
precisely k arms be pulled in any time step is relaxed. As we will see, this is essentially Whittle's 
relaxation and the policy developed for this relaxation is an upper bound to the optimal policy. 
We will then use the control policy developed for this relaxed control problem to design a policy 
for the multi-armed bandit problem that is irrevocable and also offers good performance relative 
to the optimal policy for a broad class of bandits. 

Consider the following relaxation of the program LP(ttq), RLP(ttq). RLP(ttq) may be viewed 
as a primal formulation of Whittle's relaxation: 

max • Ei Ei E Sl ,a t Ki(si,ai, t)n(Si, Oi), 

s -t- Ea, n(Si,a,i,t) = E s <, a < P i{ s 'v a '^ s i)^i{ s 'v a 'v t - !), V * > 0, Sj £ S U i. 

Ei [ T " E Sl E*Ti(*i, <M)] <*T, 

Ea, °) = E s -5 l=Sl ^O(S), 

vr > 0, 

where 7Tj(sj, a^, i) is the probability of the ith bandit being in state Sj at time t and choosing action 

Oi- 

The program above relaxes the requirement that precisely A; arms be pulled in a given time 
step; instead we now require that over the entire horizon at most kT arms are pulled in expectation, 
where the expectation is over policy randomization and state evolution. The first set of equality 
constraints enforce individual arm dynamics whereas the first inequality constraint enforces the 
requirement that at most kT arms be pulled in expectation over the entire time horizon. The 
following lemma makes the notion of a relaxation to LP{ttq) precise; the proof may be found in the 
appendix. 

Lemma 1. OPT(RLP(n )) > OPT(LP(n )) 

Given an optimal solution tt to RLP(ttq), one may consider the policy /j, R , that, assuming we 
are in state s at time t, selects a random action fi R (s,t), where [i R (s,t) = a with probability 
Yli (fti(si, a>i,t)/ J2ai ^i( s ij a ii 0) independent of the past. Noting that the action for each arm % is 
chosen independently of all other arms, we use fi R (si,t) to denote the induced policy for arm i. 
By construction, E[J^ R {s, 0)|s ~ ttq] = OPT(RLP(ttq)). Moreover, we have that fi R satisfies the 
constraint 



T-l 



< kT, 



.t=0 i 

where the expectation is over random state transitions and endogenous policy randomization. 

Of course, fi R is not necessarily feasible; we ultimately require a policy that entails at most k 
arm pulls in any time step. We will use fi R to construct such a feasible policy. In addition, we 
will see that if an arm is pulled and then idled in some subsequent time step, it will never again 
be pulled, so that the policy we construct will be irrevocable. In what follows we will assume for 
convenience that ttq is degenerate and puts mass 1 on a single starting state. That is, no(si) = 1 
for some G Si for all i. We first introduce some relevant notation. Given an optimal solution tt 
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to RLP(ttq), define the value generated by arm % as the random variable 

T-l 
t=0 

and the 'active time' of arm i, Ti as the total number of pulls of arm i entailed under that policy 

T-l 

The expected value of arm i, E\Rj\ = J2 S a t ^i( s i> a «> r i( s «> a i)i an d the expected active time 
E[Ti] = E s , ,ai,t:a,i^<f>i n i( s ii a ii^) ■ We will assume in what follows that E[Ti] > for all i; otherwise, 
we simply consider eliminating those i for which E[T{\ = 0. We will also assume for analytical 
convenience that ^ E\Tj\ = kT. Neither assumption results in a loss of generality. 

To motivate our policy we begin with the following analogy with a packing problem: Imagine 
packing n objects into a knapsack of size B. Each object i has size Tj and value Ri. Moreover, 
we assume that we are allowed to pack fractional quantities of an object into the knapsack and 
that packing a fraction a of the ilh object requires space aTi and generates value aRi. An optimal 
policy is then given by the following greedy procedure: select objects in decreasing order of the ratio 
Ri/Ti and place them in to the knapsack to the extent that there is room available. If one had more 
than a single knapsack and the additional constraint that an item could not be placed in more than 
a single knapsack, then the situation is more complicated. One may consider a greedy procedure 
that, as before, considers items in decreasing order of the ratio Ri/Ti and places them (possibly 
fractionally) in sequence, into the least loaded of the bins at that point. This generalization of the 
greedy procedure for the simple knapsack is suboptimal, but still a reasonable heuristic. 

Thus motivated, we begin with a loose high level description of our control policy, which we call 
the 'packing' heuristic. We think of each bandit arm i as an 'item' of value E[Ri] with size £7[Tj]. 
For the purposes of this explanation alone, we will assume for convenience that should policy fi R 
call for an arm that was pulled in the past to be idled, it will never again call for that arm to be 
pulled; we will momentarily remove that assumption. Our control policy will operate as follows: 
we will order arms in decreasing order of the ratio E[Ri]/E[Ti\. We begin with the top k arms 
according to this ordering. For each such arm we will select an action according to the policy 
specified for that arm by should this policy call for the arm to be idled, we discard that arm 
and will never again consider pulling it. We replace the discarded arm with the next available arm 
(in order of initial arm rankings) and select an action for the arm according to fi R . We repeat this 
procedure until we have selected non-idle actions for up to k arms (or no arms are available). We 
then let time advance, earn rewards, and repeat the procedure described above until the end of the 
time horizon. 

Algorithm [I] describes the packing heuristic policy precisely, addressing the fact that /xf may 
call for an arm to be idled but then pulled in some subsequent time step. 

In the event that we placed no restriction on the time horizon (i.e. we set T = oo in the algorithm 
above), we have by construction, that the expected total reward earned under the above policy is 
precisely OPT(RLP(ttq)). In essence, RLP{ttq) prescribes a policy wherein each arm generates a 
total reward with mean E[Ri] using an expected total number of pulls E[Ti], independent of other 
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Algorithm 1 The Packing Heuristic 

1: Renumber bandits so that > ^j^j ■ ■ ■ > • Index bandits by variable i. 

2: Zj <— 0, aj <— </>j for all i, s ~ 7ro( - ) 

{The 'local time' of every arm is set to and its designated action to the idle action. An initial 

state is drawn according to the initial state distribution ttq.} 
3: J <— {Total reward earned is initialized to 0.} 
4: X<- {l,2,...,ife},A<- {k+ l,...,n},B = 0. 

{Initialize the set of active (X), available (A), and discarded (D) arms.} 
5: for t = to T - 1 do 

6: while there exists an arm j 6 X with a» = do {Select up to k arms to pull.} 

7: Select an i G X with aj = 4>i 

{In what follows, either select an action for arm i or else discard it.} 

8: while di = fa and U < T do {Attempt to select a pull action for arm i} 

9: Select Oj oc 7fi(sj, ■, /«) {Select an action according to the solution to RLP(tt).} 

10: li <— li + 1 {Increment arm i's local time.} 

11: end while 

12: if li = T and ai = <fo then {Discard arm i and activate next highest ranked arm available.} 

13: X <- X \ {i}, D ^DU{ij {Discard arm i.} 

14: if A / then {There are available arms.} 

15: j <— minA {Select highest ranked available arm.} 

16: X <- X U {j}, A <- A \ {j} {Add arm to active set.} 

17: end if 

18: end if 

19: end while 

20: for Every j G X do {Pull selected arms.} 

21: Si ~ P(8i,Oi,-) 

{Pull arm i; select next arm i state according to its transition kernel assuming the use of 
action dj.} 

22: J <— J + rj(sj, Oj) {Earn rewards.} 
23: an <— 0i 
24: end for 
25: end for 
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arms. The above scheme may be visualized as one which 'packs' as many of the pulls of various 
arms possible in a manner so as to meet feasibility constraints. 

It is clear that the heuristic we have constructed entails a minimal amount of arm 'exploration'. 
In particular, we are guaranteed at most n — k changes to the set of pulled arms. One may naturally 
ask what the limited exploration permitted under this policy costs us in terms of performance. In 
addition, is this scheme computationally practical? In particular, the linear programming relaxation 
we must solve is still a fairly large program. In subsequent sections we address these issues. First, we 
present a theoretical analysis that demonstrates that the price of irrevocability is uniformly bounded 
for an important general class of bandits. Our analysis sheds light on the structural properties that 
are likely to afford a low price of irrevocability in practice. We then present results of computational 
experiments with a generative family of large-scale problems demonstrating performance losses of 
up to 5 — 10% percent relative to an upper bound on the performance of the optimal policy 
(which is potentially non-irreovcable and has no restrictions on exploration). Finally, we address 
computational issues relevant to the packing heuristic and develop a computational scheme that is 
substantially quicker than heuristics such as Whittle's heuristic. 



4 The Price of Irrevocability 

This section establishes a uniform bound on the performance loss incurred in using the irrevocable 
packing heuristic relative to an optimal, potentially non-irrevocable scheme for a useful family 
of bandits whose arms exhibit a certain decreasing returns property. This class includes bandits 
whose arms are coins of unknown biases - a family particularly relevant to a number of applications 
including those discussed in the introduction. We establish that the packing heuristic always earns 
expected rewards that are within a factor of 1/8 of an optimal scheme. Our analysis sheds light 
on those structural properties that likely afford a low price to irrevocability. In addition to being 
an indicator of robustness across all parameter regimes, this bound on the price of irrevocability 
is remarkable for two reasons. First, it does not rely on an asymptotic scaling of the system; the 
performance of the packing heuristic will 'track' that of an optimal, potentially non-irrevocable 
heuristic across all regimes. Second, the bound represents a comparison with a system where one 
is allowed recourse to arms that were pulled in the past and discarded. In particular, the bound 
thus highlights the fact that for a useful class of bandits, one may achieve reasonable performance 
with very limited exploration. The typical performance we expect from the heuristic is likely to 
be far superior (as it generally is in the case of problems for which such worst case guarantees can 
be established); in a subsequent section we will present computational experiments indicating a 
performance loss of 5 — 10% relative to an optimal policy with no restrictions on exploration. 

In what follows we first specify the decreasing returns property and explicitly identify a class of 
bandits that possess this property. We then present our performance analysis which will proceed 
as follows: we first consider pulling bandit arms serially, i.e. at most one arm at a time, in order 
of their rank and show that the total reward earned from bandits that were first pulled within the 
first kT/2 pulls is at least within a factor of 1/8 of an optimal policy. This result relies on the static 



ranking of bandit arms used, and a symmetrization idea exploited by Dean et al. (2004) in their 
result on stochastic packing where rewards are statistically independent of item size. In contrast 
to that work, we must address the fact that the rewards earned from a bandit are statistically 
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dependent on the number of pulls of that bandit and to this end we exploit the decreasing returns 
property that establishes the nature of this correlation. We then show via a combinatorial sample 
path argument that the expected reward earned from bandits pulled within the first T/2 time steps 
of the packing heuristic i.e., with arms being pulled in parallel, is at least as much as that earned in 
the setting above where arms are pulled serially, thereby establishing our performance guarantee. 

4.1 The Decreasing Returns Property 

Define for every i and I < T, the random variable 

i 

Hi) = EV(4.*)/*' 

t=0 

Li(l) tracks the number of times a given arm i has been pulled under policy /j, r among the first 
I + 1 steps of selecting an action for that arm. Further, define 

T-l 

R? =Y. l Ldl)<mn{s\,^{sil)). 
1=0 

R™ is the random reward earned within the first m pulls of arm i under the policy fi R . The 
decreasing returns property roughly states that the expected incremental returns from allowing an 
additional pull of a bandit arm are, on average, decreasing. More precisely, we have: 

Property 1. (Decreasing Returns) E[Rf +l ] - E[R™] < E[R^] - ElR^ 1 ] for allO<m<T. 

One useful class of bandits from a modeling perspective that satisfy this property are bandits 
whose arms are 'coins' of unknown bias. The following discussion makes this notion more precise: 

4.1.1 An example of a bandit with decreasing returns: Coins 

We define a 'coin' to be any multi-armed bandit for which every arm i has action space a% = {p, 
with r(si,p) > for all Sj G Si, and satisfies the following property: 

r(si,p) > P{s i ,p,s' i )r{s' i ,p), Vsj G Si. 

The above sub-martingale characterization of rewards intuitively suggests the decreasing returns 
property. In particular, it suggests that the returns from a pull in the current state are at least 
as large as the expected returns to a pull in a state reached subsequent to the current pull. The 
decreasing returns property for coins is established in the following Lemma whose proof may be 
found in the appendix: 

Lemma 2. Coins satisfy the decreasing returns property. That is, if Ai = {p, <fii} Vi, and 

r(si,p) > ^2 P(s i ,p,s' i }r(s , i ,p), Vi,Si G Si, 
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then 

E[Rf +1 ] - E[R™] < E[R™} - EiRf- 1 } 

for allO <m <T. 

Returning to our motivating example of dynamic product assortment selection, we note that in 
estimating the bias of a binomial coin of unknown bias given some initial prior on coin bias, Bayes' 
rule implies that the estimated bias after n observations (which generate the filtration J- n ), fi n +i 
satisfies ^[/in+il-^n] = A*n- Thus, bandits with such arms wherein the reward from an arm is some 
non-negative scalar times the bias, automatically possess the decreasing returns property. 

4.2 A Uniform Bound on the Price of Irrevocability for Bandits with 'Decreas- 
ing Returns' 

For convenience of exposition we assume that T is even; addressing the odd case requires essentially 
identical proofs but cumbersome notation. 

We re-order the bandits in decreasing order of E[Ri]/E[Ti] as in the packing heuristic. Let us 
define 

H* = min jj : ^ E[Ti\ > kT/2 J . 

Thus, H* is the set of bandits that take up approximately half the budget on total expected pulls. 
Next, let us define for all i, random variables Ri and Tj according to Ri = Ri,T = T{ for all i < H* . 

kT/2— '■p 11 *- 1 E\T\ 

We define Rh* = cxRh* and Tjj* = aTn* , where a = e[tJ^* ] ~ ■ 

We begin with a preliminary lemma: 

Lemma 3. 

H* 

J2 E i^ Z -OPT(RLP(n )). 

i=i 

Proof. Define a function 



3=1 



where (a A b) = min(a, b). By construction (i.e. since ^prj is non-increasing in i), we have that / 
is a concave function on [0, kT\. Now observe that 

Next, observe that 

OPT(RLP^ )) = £ §^E[T t ] = f(kT). 

By the concavity of / and since /(0) = 0, we have that /(fcT/2) > \f{kT), which yields the 
result. □ 
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We next compare the expected reward earned by a certain subset of bandits with indices no 
larger than H* . The significance of the subset of bandits we define will be seen later in the proof 
of Lemma [6] - we will see there that all bandits in this subset will begin operation prior to time 
T/2 in a run the packing heuristic. In particular, define 



H* 

R l/2 = E 1 {Y,yJ 1 T ] <kT/2} R i- 



i=l 



Lemma 4. 



Proof. We have: 



E[Ri/ 2 ] > ~OPT(RLP(fi )). 



H* fi-1 

mi/*} = E Pr E r ; < kT i 2 1 E ^ 
i=i \j=i 

(6) H * ( l ~ l 

^E Pr \J2 T i <kT / 2 ) E ^ 



H* / i-1 \ 

= E Pr (E f i < * r / 2 )^ 

i=l i=l ' 



(/) i 



i=i i=i 



> ^E^ 

i=i 

(9) 1 

> -OPT{RLP{jt )) 

Equality (a) follows from the fact that under policy /j, R , Ri is independent of Tj for j < i. In- 
equality (b) follows from our definition of Rf. Ri < R{. Equality (c) follows from the fact that by 
definition Tj = T$ for all i < H* . Inequality (d) invokes Markov's inequality. 

Inequality (e) is the critica l step in establishing the result and uses the simple symmetrization 



idea exploited byjDean et al. 



(2004): In particular, we observe that since pffi] < %^ 3 } for i > 



E[Ti] - E[Tj_ 

j, it follows that E[Ri]E[Tj] < %(E[Ri]E[Tj] + E[Rj]E[Ti\) for i > j. Replacing every term 
of the form i?[.Rj]i?[7j] (with i > j) in the expression preceding inequality (e) with the upper 
bound 7j(E[Ri]E[Tj] + E[Rj]E[Ti\) yields inequality (e). Inequality (f) follows from the fact that 
Yli=iE[Ti) = kT/2. Inequality (g) follows from Lemma □ 
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Before moving on to our main Lemma that translates the above guarantees to a guarantee 
on the performance of the packing heuristic, we need to establish one additional technical fact. 
Recall that Rf 1 is the reward earned by bandit i in the first m pulls of this bandit. Exploiting the 
assumed decreasing returns property, we have the following Lemma whose proof may be found in 
the appendix: 

Lemma 5. For bandits satisfying the decreasing returns property (Property^, 



E 



H* 



1 E: 1 1 T,<ff/2 



R 



T/2 



> 2 E[Ri/2\ 



We have thus far established estimates for total expected rewards earned assuming implicitly 
that bandits are pulled in a serial fashion in order of their rank. The following Lemma connects 
these estimates to the expected reward earned under the ^P ackm s policy (given by the packing 
heuristic) using a simple sample path argument. In particular, the following Lemma shows that 



the expected rewards under the ^P ackm s policy are at least as large as E Yli=i 1 



1 



R 



T/2 



Lemma 6. E[J^ paxMns (s, 0)\s 



tto] > E pf =1 Tj<kT ., 

Proof. For a given sample path of the system define 



R 



T/2 



(H*) A min < i : ^ Tj > kT/2 



On this sample path, it must be that: 



(4.1) 



H* 



R 



T/2 



e«j 



T/2 



We claim that arms 1,2, . . . ,h are all first pulled at times t < T/2 under ^P ackm s. Assume to 
the contrary that this were not the case and recall that arms are considered in order of index under 
packing^ go an arm with index i is pulled for the first time no later than the first time arm / 



is pulled for I > i. Let h' be the highest arm index among the arms pulled at time t 
that h' < h. It must be that Yl^i^j — kT/2. But then, 



T/2 - 1 so 



H* A min < 



: J2 T i - kT / 2 f - h> 



which is a contradiction. 

Thus, since every one of the arms 1, 2, . . . , h is first pulled at times t < T/2, each such arm may 
be pulled for at least T/2 time steps prior to time T (the horizon). Consequently, we have that the 
total rewards earned on this sample path under policy ^P ackm s are at least 



i=i 



T/2 
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Using identity (4.1) and taking an expectation over sample paths yields the result. 



□ 



We are ready to establish our main Theorem that provides a uniform bound on the performance 
loss incurred in using the packing heuristic policy relative to an optimal policy with no restrictions 
on exploration. In particular, we have that the price of irrevocability is uniformly bounded for 
bandits satisfying the decreasing returns property. 

Theorem 1. For multi-armed bandits satisfying the decreasing returns property (Property [7p, 
E[J^ cking (s,0)\s~K ]>lE[J*(s,0)\s ~ 7ro] for all initial state distributions ttq. 

Proof. We have from Lemmas |1|4|5| and [6] that 

E[J^ ckins (s,0)\s ~ fro] > \oPT{RLP{^)). 

o 

We know from Lemma [j] that OPT(RLP(n )) > OPT(LP(n )) = E[J*(s,0)\s ~ tt ] from which 
the result follows. □ 

Our analysis highlighted a structural property - decreasing returns - that is likely to afford 
a low price of irrevocability. The next section demonstrates computational results that suggest 
that in practice we may expect this price to be quite small (on the order of 5 — 10%) for bandits 
possessing this property. 



5 Computational Experiments 

This section presents computational experiments with the packing heuristic. We consider a number 
of large scale bandit problems drawn from a generative family of problems to be discussed shortly 
and demonstrate that the packing heuristic consistently demonstrates performance within about 
5 — 10 % of an upper bound on the performance of an unrestricted (i.e. potentially non-irrevocable) 
optimal solution to the multi-armed bandit problem. In particular, this suggests that the price of 
irrevocability is likely to be small in practice, at least for models of the type we consider here. Since 
the bandits considered in our experiments - Binomial coins of uncertain bias - are among the most 
widely used applications of the multi-armed bandit model, we view this to be a positive result. 

The Generative Model: We consider multi- armed bandit problems with n arms up to k of 
which may be pulled simultaneously at any time. The ith arm corresponds to a Binomial(m, Pj) coin 
where m is fixed and known, and P, is unknown but drawn from a Dirichlet(aj, /%) prior distribution. 
Assuming we choose to 'pull' arm i at some point, we realize a random outcome Mj G {0, 1, . . . , m}. 
Mi is a Bernoulli (to, Pi) random variable where Pi is itself a Dirichlet(aj, Pi) random variable. We 
receive a reward of r^Mj and update the prior distribution parameters according to cti <— cti + Mi, 
Pi <— Pi + Tri — Mi. By selecting the initial values of a, and Pi for each arm appropriately we can 
control for the initial uncertainty in the value of Pj. This model is, for instance, applicable to the 



dynamic assortment selection problem discussed earlier (see Caro and Gallien (2007)) with each 
coin representing a product of uncertain popularity and Mi representing the uncertain number of 
product i sales over a single period in which that product is offered for sale. We recall from our 
previous discussion that this family of bandits satisfies the decreasing returns property and from 
our performance analysis we expect a reasonable price of irrevocability. 
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Summary of Computational Experiments 


Coeff. of Variation 

(cv) 


Arms 
in) 


Simultaneous Pulls 

fife) 


Horizon 
(T) 


Performance 

, packing , . 

\J I J ) 




500 


50 


25 


0.91 




500 


50 


40 


0.92 




500 


100 


25 


0.93 


Moderate 


500 


100 


40 


0.94 


(1) 


100 


10 


25 


0.88 




100 


10 


40 


0.89 




100 


20 


25 


0.91 




100 


20 


40 


0.93 




500 


50 


25 


0.89 




500 


50 


40 


0.90 




500 


100 


25 


0.90 


High 


500 


100 


40 


0.91 


(2.5) 


100 


10 


25 


0.87 




100 


10 


40 


0.88 




100 


20 


25 


0.90 




100 


20 


40 


0.91 



Table 1: Computational Summary. Each row represents summary statistics for 100 distinct ran- 
dom bandit problems with the specified n, k, T and cv parameters. Performance for each instance 
was computed from 3000 simulations of that instance. Performance figures thus represent an av- 
erage over the generative family with the specified n, k, T and cv parameters as also over system 
randomness. 

We consider the following random instances of the above problem. We consider bandits with 
(n,k) e {(500, 50), (500, 100), (100, 10), (100, 20)}. These dimensions are representative of large 
scale applications of which the dynamic assortment problem is an example. For each value of (n, k) 
we consider time horizons T = 25 and T = 40. For every bandit problem we consider, we subdivide 
the arms of the bandit into 10 groups. All arms within a group have identical statistical structure, 
that is, identical ri values and identical initial values of ai and For each value of (n, k,T), we 
generate a number of problem instances by randomly drawing prior parameters for bandit arms. 
In particular, for all arms in a given group we select q« uniformly in the interval [0.05,0.35] and 
then select that value of which results in a prior co-efficient of variation cv £ {1,2.5}. These 
co-efficients of variation represent, respectively, a moderate and high degree of a-priori uncertainty 
in coin bias (or in the context of the dynamic assortment application, product popularity). In 
addition, rj is drawn uniformly on [0,2] and we take m = 2. We generate 100 random problem 
instances for each co-efficient of variation. Control policies for a given bandit problem instance 
are evaluated over 3000 random state trajectories (which resulted in 98% confidence intervals that 
were at least within +/-1% of the sample average). 

Evaluating Performance: A striking feature of our performance results is that the price 
of irrevocability is quite small, a trend that appears to hold over varying parameter regimes. In 
particular, we make the following observation: 
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• Consider problems with a small number of arms (100) with a large number of simultaneous 
pulls (20) allowed. Intuitively, an optimal policy could reasonably explore all arms in this 
setting before settling on the 'best' arms. We thus expect the price of irrevocability to be 
high here. Even in this regime we find that the price of irrevocability is only about 10 — 11 
% of optimal performance. 

• Consider problems with a high degree of a-priori uncertainty in coin bias. Mistakes - that 
is, discarding an arm that is performing reasonably in favor of an unexplored arm that turns 
out to perform poorly - are particularly expensive in such problems. With a hgh co-efficient 
of variation in the prior on initial arm bias, the price of ir-revocability is indeed somewhat 
higher but continues to remain within 10 — 12 % of optimal performance. 

• For each of our experiments, we observe that keeping all other parameters fixed, relative 
performance improves with a longer time horizon. This is intuitive; with longer horizons, one 
may delay discarding an arm only once one is sure that the arm performs poorly relative to 
the expected value of the available alternatives. 

• Finally, we note that the performance figures we report are relative to an upper bound on 
optimal policy performance. Computing the optimal policy is itself an intractable task. The 
performance observed here suggests that at least for bandit problems with decreasing re- 
turns the packing heuristic is a viable approximation scheme even when irrevocability is not 
necessarily a concern. 

We can thus conclude that the price of irrevocability is small for a useful class of multi-armed 
bandit problems and that the packing heuristic performs well for this class of problems. A final 
concern is computational effort. In particular, for the largest problem instance we considered 
(n = 500), the linear program we need to solve has 3.2 million variables and about the same number 
of constraints. Even a commercial linear programming solver (such as CPLEX) equipped with the 
ability to exploit structure in this program will require several hours on a powerful computer to solve 
this program. This is in stark contrast with an index based heuristic (such as Whittles heuristic) 
that solves a simple dynamic program for each arm at every time step. In the next section we develop 
an efficient computational algorithm for the solution of RLP{tiq) that requires substantially less 
effort than even Whittles heuristic and takes a few minutes to solve the aforementioned program 
on a laptop computer. 

6 Fast Computation 

This section considers the computational effort required to implement the packing heuristic. We 
develop a computational scheme that makes the packing heuristic substantially easier to implement 
than popular index heuristics such as Whittle's heuristic and thus establish that the heuristic is 
viable from a computational perspective. 

The key computational step in implementing the packing heuristic is the solution of the linear 
program RLP(ttq). Assuming that \Si\ = O(S) and \A%\ = 0{A) for all i, this linear program 
has 0{nTAS) variables and each Newton iteration of a general purpose interior point method will 
require O ((nTAS) 3 ) steps. An interior point method that exploits the fact that bandit arms are 
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coupled via a single constraint will require 0(n(TAS) 3 ) computational steps at each iteration. We 
develop a combinatorial scheme to solve this linear program that is in spirit similar to the classical 
Dantzig- Wolfe dual decomposition algorithm. In contrast with Dantzig- Wolfe decomposition, our 
scheme is efficient. In particular, the scheme requires 0(nTAS 2 log(kT)) computational steps 
to solve RLP{ttq) making it a significantly faster solution alternative to the schemes alluded to 
above. Equipped with this fast scheme, it is notable that using the packing heuristic requires 
0(nAS 2 log(fcT)) computations per time step amortized over the time horizon which will typically 
be substantially less than the Q(nAS 2 T) computations required per time step for index policy 
heuristics such as Whittle's heuristic. 

Our scheme employs a 'dual decomposition' of RLP(ttq). The key technical difficulty we 
must overcome in developing our computational scheme for the solution of RLP{j:q) is the non- 
differentiability of the dual function corresponding to RLP(no) at an optimal dual solution which 
prevents us from recovering an optimal or near optimal policy by direct minimization of the dual 
function. 

6.1 An Overview of the Scheme 

For each bandit arm i, define the polytope D^ttq) 6 rI^H- 4 *! 71 of permissible state-action frequencies 
for that bandit arm specified via the constraints of RLP{ttq) relevant to that arm. 

A point within this polytope, 7Tj, corresponds to a set of valid state-action frequencies for the 
ith bandit arm. With some abuse of notation, we denote the expected reward from this arm under 
7Tj by the 'value' function: 

T-l 

RMi) = ^ 7Ti(si,ai,t)ri(si,ai). 
t=o 

In addition denote the expected number of pulls of bandit arm i under m by 

Ti{lTi) = T - ^2^2lTi(Si,(f>i,t). 

Si t 

We understand that both Ri(-) and Tj(-) are defined over the domain Dj(7To)- 
We may thus rewrite RLP{ttq) in the following form: 



(61) max. Eiflifa). 

{ ' } s.t. EiTifa) < kT. 

The Lagrangian dual of this program is DRLP{ttq): 

min . XkT + E* max 7rj (Ri(n) - AT i (7r i )) , 
s.t. A > 0. 

The above program is convex. In particular, the objective is a convex function of A. We will 
show that strong duality applies to the dual pair of programs above, so that the optimal solution 
to the two programs have identical value. Next, we will observe that for a given value of A, it is 
simple to compute max ffi (i?j(7Tj) — XT^TTi)) via the solution of a dynamic program over the state 
space of arm i (a fast procedure). Finally it is simple to derive useful a-priori lower and upper 
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bounds on the optimal dual solution A*. Thus, in order to solve the dual program, one may simply 
employ a bisection search over A. Since for a given value of A, the objective may be evaluated 
via the solution of n simple dynamic programs, the overall procedure of solving the dual program 
DRLP(k ) is fast. 

What we ultimately require is the optimal solution to the primal program RLP{ttq). One 
natural way we might hope to do this (that ultimately will not work) is the following: Having 
computed an optimal dual solution A*, one may hope to recover an optimal primal solution, tt* 
(which is what we ultimately want), via the solution of the problem 

(6.2) max(Ri(n)-X*Ti(n))- 

for each i. This is the typical dual decomposition procedure. Unfortunately, this last step need not 



necessarily yield a feasible solution to RLP(ttq). In particular, solving (6.2) for A = A* + e may 
result in an arbitrarily suboptimal solution for any e > 0, while solving (6.2) for a A < A* may 
yield an infeasible solution to RLP(ttq). The technical reason for this is that the Lagrangian dual 
function for RLP{tt) may be non-differentiable at A*. These difficulties are far from pathological, 
and Example [T] illustrates how they may arise in a very simple example. 

Example 1. The following example illustrates that the dual function may be non-differentiable at 



an optimal solution, and that it is not sufficient to solve (6.2) for A < A* or A = A* + e for an e > 
arbitrarily small. Specifically, consider the case where we have n = 2 identical bandits, T = 1, and 
K = 1. Each bandit starts in state s, and two actions can be chosen for it, namely, a and the 
idling action (p. The rewards are r(s,a) = 1 and r(s,cp) = 0. Thus, RLP(ttq) for this specific case 
is given by: 

max. 7Ti(s, a, 0) + 7T2(s, a, 0), 
s.t. 7Ti(s, a, 0) + 7T2(s, a, 0) < 1, 

where tti S D^ttq), i = 1, 2. Clearly, the optimal objective function value for the above optimization 
problem is 1. The Lagrangian dual function for the above problem is 

g(X) = \+ max 7i"i(s, a, 0)(1 — A) + max ^(s, a, 0)(1 — A) 

7ri(s,a,0) 7T2(s,a,0) 

J 2 - A A < 1 
~ \ A A > 1 

Not the dual function is minimized at A* = 1, which is a point of non-differentiability. Moreover, 



solving (6.2) at A* + e for any e > 0, gives tti(s, a, 0) = 7T2(s,a, 0) = which is clearly suboptimal. 



Also, a solution for < A < A* is 7Ti(s, a, 0) = ^(s, a, 0) = 1, which is clearly infeasible. 



Notice that in the above example, the average of the solutions to problem (6.2) for A = A* — e and 
A = A*+e does yield a feasible, optimal primal solution, iti(s, a, 0) = vr2(s, a, 0) = 1/2. We overcome 
the difficulties presented by the non-differentiability of the dual function by computing both upper 



and lower approximations to A*, and computing solutions to (6.2 ) for both of these approximations. 
We then consider as our candidate solution to RLP(ttq), a certain convex combination of the two 
solutions. In particular, we propose algorithm |2j that takes as input the specification of the bandit 
and a tolerance parameter e. The algorithm produces a feasible solution to RLP(ttq) that is within 



19 



an additive factor of 2e of optimal. 



Algorithm 2 RLP SOLVER 



1: A feas <- r max + 5, for any S > 0, A infcas <- 0. 

2: For all i, ^ eas < — 7Tj G argmax (J2<(iri) - A feas Ti(7Ti)), 

^infeas ^ € a rgmax (i?^) - A infcas Ti(7Ti)) . 
while A fcas - A infcas >^do 



kT 



^feas j ^inf'eas 



2 

for z = 1 to n do 

7r| <- 7Tj e argmax {Ri{^i) - XTifc)). 

end for 

if E?=i r «) > kT then 

A infcas ^_ A;7r mfeas ^_ ^ y £ 

else 

A fcas «- A,7rf as «- 7T*, Vt 
end if 
end while 

if E 4 7i(vrj nfcas ) - Ti(^ feas ) > then 

fcT _ EiTi(7r fca S) 

Ei^(Tr fea8 )-^(Ti eas ) 

else 

a <- 
end if 

for i = 1 to n do 

vrf LP <- avrj nfcas + (1 - a)vr[ cas 
end for 



It is clear that the bisection search above will require 0(log(r max /cT/e)) steps (where r n 



maxj )Sij(Ii r(si, cfj)). At each step in this search, we solve n problems of the type in (6.2), i.e. 
inax^j (Ri(iTi) — ATj(7Tj)). These subproblems may be reduced to a dynamic program over the 
state space of a single arm. In particular, we define a reward function r i : Si — > R+ according 
to fj(sj,aj) = ri(si,di) — Al Qi ^0 4 and compute the value of an optimal policy starting at state so,i 
(where so is that state on which ttq places mass 1) assuming fi as the reward function. This requires 
0(S 2 AT) steps per arm. Thus the RLP Solver algorithm requires a total of 0(nS 2 AT log r max /cT/e) 
computational steps prior to termination. The following theorem, proved in the appendix estab- 
lishes the quality of the solution produced by the RLP Solver algorithm: 

Theorem 2. RLP Solver produce a feasible solution to RLP(ttq) of value at least OPT(RLP(ttq)) — 
2e. 

The RLP Solver scheme was used for all computational experiments in the previous section. 
Using this scheme, the largest problem instances we considered were solved in a few minutes on a 
laptop computer. 
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7 Concluding Remarks 



This paper introduced 'irrevocable' policies for the multi-armed bandit problem. We hope to 
draw two main conclusions from our presentation thus far. First, in addition to being a desirable 
constraint to impose on a multi-armed bandit policy, irrevocability is frequently a cheap and thereby 
practical constraint. In particular, we have attempted to show via a host of computational results 
as also a theoretical analysis that the price of irrevocability, i.e. the performance loss incurred 
relative to an optimal scheme with no restrictions on exploration, is likely to be small in practice. 
In fact, the performance of the heuristic we developed suggests that it is likely to be useful even 
when irrevocability is not a concern. Second, we have shown that computing good irrevocable 
policies for multi-armed bandit problems is easy. In particular, we developed a fast computational 
scheme to accomplish this task. This scheme is faster than a widely used heuristic for general 
multi-armed bandit problems. 

This research serves as a point of departure for a number of interesting questions that we believe 
would be interesting to explore: 

• The packing heuristic is one of many possible irrevocable heuristics. It is attractive since it 
offers satisfactory performance in computational experiments, affords a worst-case price of 
irrevocability analysis (and so is theoretically robust), and finally can be implemented with 
less computational efforts than typical index heuristics. The packing heuristic is, however, 
by no means the only irrevocable heuristic one may construct. An alternative for instance, 
would be to modify Whittle's heuristic so as to simply ignore bandits that were discarded in 
the past or equivalently formulate a corresponding restless bandit problem where discarded 
arms generate no rewards. The present work establishes irrevocability as a cheap constraint 
to place on many multi- armed bandit problems. Moving forward, one may hope to construct 
other high performing irrevocable heuristics for the multi-armed bandit problem. 

• What happens to the price of irrevocability in interesting asymptotic parameter regimes? An 
example of such a regime may include for instance, simultaneously scaling the number of 
bandits, the number of simultaneous plays allowed, k, as also the horizon T. The correlation 
in the reward earned from a bandit arm and the length of time the arm is pulled preclude a 
useful, straightforward large deviations type extension to our analysis. Nonetheless, this is 
an important question to ask. 

• We have illustrated the decreasing returns property for coins. This property is fairly natural 
though, and there are a number of other types of bandits that may well possess this property. 
Staying in the vain of dynamic assortment selection, it may well be the case that incorpo- 
rating inventory and pricing decisions for each product may still yield bandits satisfying this 
property. 

• In addition to bandits satisfying the decreasing returns property, are there other interesting 
classes of bandits that afford a low price of irrevocability? 

• Is irrevocability a cheap constraint for interesting classes of restless bandit problems? Given 
our results, it is intuitive to expect that this may be the case for bandits with switching costs 
which represent one simple class of restless bandit problems. 



21 



Our work thus far suggests that irrevocable policies are an effective means of extending the 
practical applicability of multi-armed bandit approaches in several interesting scenarios such as dy- 
namic assortment problems or sequential drug trials where recourse to drugs that were discontinued 
from trials at some point in the past is socially unacceptable. Progress on the questions above will 
likely further the goal of extending the practical applicability of the multi-armed bandit approach. 
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A Proofs for Section 3 

Lemma 1. OPT(RLP(n )) > OPT(LP(n )) 

Proof. Let it be an optimal solution to LP(tt ). We construct a feasible solution to RLP(tt ) of equal value. 
In particular, define a candidate solution to RLP{ttq), tt according to 



266. 



25. 
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This solution has value precisely OPT(LP(ttq)) . It remains to establish feasibility. For this we first observe 
that 

^Pi{s[,a! i ,Si)n(s' i ,a' i ,t-l)=^Pi{s' i ,a' i ,s i ) Y 7r(a, 6, * - 1) 

= Y Pi(s'i,a'i,Si) Y [Y2Y[P(sj,aj,8j)\Ti:(s,a,t-l) 

s 'i-, a 'i s y a:si=s / -,a,i=a , i \ s—i j^i J 

(A.l) = P(s',a',s)7t(s',a',t-l) 

s' ,a' ,s:si=Si 

= Y Hs,a,t) 

s:Si — Si ,a 

= y]%i(si,ai,t) 

ai 

Next, we observe that the expected total number of pulls of arm pulls under the policy prescribed by tt 
is simply 

Y n(s,a,t) 

i s.t,a:ai^4>i 

Since the total number of pulls in a given time step under 7r is at most k, we have 

Y Y 7r(s,a,t) < kT 



But. 



so that 



(A.2) 



Y Y *(s,a,t)=YY Y ^O^oM) 

= YY (i 

i t \ Si ) 

= Y( T -Y^^ t n > 

i \ Si,t ) 

Y (t-Y^^)) ^ kT 

i \ Si,t ) 



From (A.l) and (A.2 1, 7r is indeed a feasible solution to RLP(ttq). This completes the proof. 



□ 



B Proofs for Section 4 

Lemma 2. Coins satisfy the decreasing returns property. That is, if Ai — {p, <fii} Vz, and 

r(si,p) > Y P( s i'Pi s 'i) r ( s i>P)' Vi^i e 5 » 



s' eSi 



then 



E[R™ +1 ] - E[R™} < E[R™} - £[iC ] 
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for all < to < T ' . 



Proof. To see that coins satisfy the decreasing returns property, we first introduce some notation. It is clear 
that the policy fi^ induces a Markov process on the state space Si. We expand this state space, so as to 
track the total number of arm pulls so that our state space now become Si x {T, T}. The policy /if induces 
a distribution over arm i states for every time t < T, which we denote by the variable tt. Thus, tr(si, to, t, o^) 
will denote the probability of being in state (si,m) at time t and taking action a^. 
Now, 

E[R^ +1 - i?™] = *(si,rn,t,p)r(s,p) 



s,t<T 



and similarly, for E[Rf - R^]. 
But, 



^2 n(si,m,t,p)r(si,p) - ^ „,..,.„ I 

Si,t<T Si,t<T-l 



X n(si,m- l,t,p) I P{si,p,s'i)g{s'i,t + l)r(s-,p) 



where g(si,t) = 1 — n^=t P r (/ X f ( s i; = Here n^=t ^(/^C 8 ** O = 0») ^ s the probability of never 
pulling the arm after reaching state Sj at time t so that g(s{,t) represents the probability of eventually 
pulling arm i after reaching state Sj at time i. The second inequality follows from the assumption on reward 
structure in the statement of the Lemma. We thus see that coins satisfy the decreasing returns property. □ 

Lemma 5. For bandits satisfying the decreasing returns property (Property^), 



< ^2 ^{si,m- l,t,p) \^2P{si,p,s' l )r{s[ 1 p) 

s it t<T-l \ s[ 

< ^2 Hsi,m-l,t,p)r(si,p) 

Si,t<T-l 
Si,t<T 

= E[R? - R?- 1 ] 



E 



H* 



1 E':lij<^/2 fi 



T/2 



,i=l 



> 2^^1/2]- 



Proof. We note that assuming Property [lj implies that E[R^ 2 ] > ^E[Rj\ for all i. The assertion of the 
Lemma is then evident - in particular, 

E[Ry 2 ] = E Pr ( E T J < kT l 2 J ^ 
»=i \i=i / 

<E Pr (E^< fcT / 2 ) 2 ^ /2 ] 

"J?* 



= 2£ 



Zj T j <kT/2 rt i 



T/2 



where the first and second equality use the fact that R t and R i 1 are each independent of Tj for j ^ i. □ 
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C Proof of Theorem [2] 

The following lemma shows that the optimal objective function value of the dual is equal to OPT(RLP(ttq)). 



In particular, it shows that Slater's constraint qualification condition holds (see, for example, Boyd and 



Vandenberghe (2004)). 



Lemma 7. OPT(RLP(n )) = OPT(DRLP(n )). That is, strong duality holds. 



Proof. To show this, it is sufficient to show that there is a strictly feasible solution to (6.1 1, i.e. , the inequality 
is satisfied strictly. This is straightforward - in particular, for each bandit i, set TTi(si, fa,t) = ttci for all 
Si and t, where 7To,i( s i) is the probability of bandit i starting in state s^. Set 7r,-(sj, a,,t) = for a, ^ fa for 
all Si,t. These state action frequencies belong to D^ttq), and also give Ti(iii) = 0. □ 

We denote R* = OPT(RLP(tt )) = OPT(DRLP(n )). Also, define the following set of total running- 
times for all bandits corresponding to a dual variable A: 



T(A) = |x)T < (7r i ) 



Hi e axgmax(Ri(iTi) - XTifc)) ) , Vi 



Lemma 8. J/0 < Ai < A2, then 

minT(Ai) > maxT(A2). 
Proof. We denote the objective function in DRLP(tt ), i.e., the dual function by: 

5 (A) = XkT + max {Riim) - XTi(m)) . 

i 

The slack in the total running time constraint ^ T^tti) < kT, i.e. kT — J2i T(ni), is a subgradient of g for 
any it such that iti S argmax (ii,-(7Tj) — ATj(TTj)) (see Shor ( 1985 1). Thus, the set of subgradients of the dual 



function g at A are given by 

dg(X) ={kT-t:te T(A)}. 

Then, since g is a convex function, it follows that for < Ai < A2, 

kT-t x <kT-t 2 , Vti eT(Ai), t 2 €T(A a ). 
The lemma then follows. □ 



(71"*, A*) is an optimal solution for the primal and dual problems if and only if (see, for example, Boyd 



and Vandenberghe (2004 1 ) 



7T* e argmax (R l (TT i ) - A*T i (7r i )) , 

(CI) x . x , 

either A* > and ^^(tt*) = kT, or A* = and ^T^tt*) < kT. 

i i 

We prove the correctness of the RLP Solver algorithm separately for the cases when A* = is optimal, 
and when any optimal solution satisfies A* > 0. We denote the values of the bounds on the dual variable 
that are computed by the last iteration of the RLP solver algorithm by A and A ln . Recall that, 

7rf as e argmax (i^fa) - A fcas T,(7r 4 )) , 
^ inf ° aS G argmax (R^) - A infcas TU^)) ■ 
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We introduce some additional notation: 

yfcas = Ti (7lf aS ) , R iCaS = aS )' 

Tinfcas \ A 1 / infcas\ oinfcas \ ^ td /_infeas\ 

= 2_^±i{Ki ), K = 2^n t {TT l ). 



Thus, 



(C.2) 

~ ' '\infcas\ \ infcas j^rji , ninfeas \ fcasrriinfcas 



^^^feas^ \^ eas kT -\- J^ cas j^fcas/jifeas 

^^infcas^ ^infcas j^rp _|_ ^infcas yteasrpi] 



Lemma 9. If (tt* , X*) is a solution to ( |C.1[ ) with A* = 0, then 

R* - (ai? infcas + (1 - a)R fcas ) < e. 



Proof. If A* = 0, it follows from ( C 1 1 that there is some t € T~(0) such that t < kT. Hence, it follows 
from Lemma [8] that for any A > 0, maxT(A) < kT. Hence, Line 11 of the RLP solver algorithm is always 
invoked, and so, the RLP solver algorithm converges to 

A infcas _ q and Q < A fcas < e /^T). 

Hence, 7i-i nfGas g argmax(i?i(7Ti)). Also, g(X) is minimized at A* = 0. Hence, it follows from Lcmma|7jthat 
(C.3) R* = 5 (0) =y m axi? l (7r ! ) = i? infoas . 

i 

bmcc, A fcas > 0, it follows fr om T ' < kT. Hence, we now consider the following three cases: 

• Case 1: T infcas < kT. 

Here, a = 1, and hence, using (C.3 1 it follows that 

R* - (ai? infoas + (1 - a)R icas ) = 0. 

• Case 2: T foas = kT. 



In this case, (it s , A ) satisfy the optimality conditions in (C.ll. Thus, R lcas = i?* 5 and so (since 

^infcas = R * by (Jc3|) 



R* - (aR infcas + (1 - a)R fcas ) = 



• Case 3: T infcas > kT > T fcas . 
Since, g(X) is minimized at A = 0, 



R* = .9(0) < g(A fcas ) = A fcas fcT + R icas - A fcas T foas 



Since, R* = R inicas (from ([Cli])), and using the fact that < a < 1 when T infcas > kT > T foas , we 
have 

R* - ai? infoas - (1 - a)i? foas = (1 - a)(R* - i? fcas ) 

< (l-a)(fcT-T fcas )A fcas 
(C.4) " , 

< e, 
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□ 



Lemma 10. If every solution to ( C 1 1 satisfies X* > 0, then 

->infcas 



R* - laR" 



(l-a)R icas ) <2e. 



Proof. The RLP solver algorithm is initialized with A infcas = 0. Since, A* > 0, and (kT) e T(A*) (([C~T])), 
it follows from Lemma|8]that minT(O) > kT. But (kT) T(0), else there would be a solution to ( C 1 1 
that satisfies A* = 0, leading to a contradiction. Thus, minT(0) > kT, and so lines 8-12 of the RLP solver 
algorithm guarantee that 



(C.5) 



rpinfeas ^ j^rp 



Using an appropriate modification of the optimality conditions in ( C. 1 1 for the case where the horizon is 
T infeas ( instea( J f £;T), we see that R lnicas is the maximum reward earned by any policy in {it : ^ TjfVi) < 
ymfcasj Since, R* is the maximum reward earned by any policy in {it : J^i T.(^i) < kT < T lnleas }, 



(C.6) 



R 



infcas 



> R*. 



We now argue that T teas < kT. The RLP solver algorithm is initialized with A tcas > r 



Since, 



7if as g argmax (R^n,) - A feas T i (7r i )), initially, the optimal policy is to idle at all times. Thus, T fcas < kT 

at initialization; at all other iterations, lines 8-12 of the algorithm ensure that T lcas < kT. 
We now consider the following two cases separately: 



Case 1: T fcas = kT. 

In this case, (7r foas , A feas ) satisfy the optimality conditions in (C.ll, and so, R 



feas 



R* . Now, using 



(off 



ifcas 



(1 - a)R ieas ) > R 



• Case 2: T fcas < kT. 

Note that the RLP solver algorithm terminates when 



(C.7) A toas - A mtoas < e/(kT). 

Now T fcas < kT and (kT) S T(A*). If A fcas < A*, it follows from Lemma [| that 

T fcas > minT(A foas ) > maxT(A*) > kT, 
which is a contradiction. Hence, 

(C.8) A fcas >A*. 

Also, since (kT) e T(X*), it follows from Lemma |8] that for any A > A*, maxT(A) < kT. So, ( |C.5 l 
implies that 



(C.9) 



A 



infcas 



< A* 



It follows from (C.7 1, (C.8 1, (C.9 1 that 



max f 0, A* - -^j 



< X' 



ifcas 



and A feas < A 



e 

kT' 
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Since <?(A) is minimized at A*, it follows from ( |C.2| and strong duality proved in Lemma [7] that 

g{\*) = R* < .9(A foas ) = R icas + A fcas (kT - T fcas ) < i? fcas + (A* + 5) (kT - T fcas ) , 

g(X*) = R* < , 9 (A infcas ) = i? infcas + A infcas (kT - T infcas ) < i? infcas + (A* - S) (kT - T infcas ) , 

where S = e/(kT). Note that the above inequalities also use T feas < kT (by assumption) and T lnfoas > 



kT (from (C.5)). Thus 



R* - aR infcas - (1 - a)i? fcas = a(iT - i? infcas ) + (1 - a )(iT - ff eas ) 

< a(5 - A*) (T infoas - fcT) + (1 - a) (A* + 8) (kT - T fcas ) 
_ 2 ^(T infcas - fcT)(feT - T foas ) 

^infcas _ J^fcas 

< 2SkT 
= 2e. 



□ 



Theorem 2. i?LP Solver produce a feasible solution to RLP(ttq) of value at least OPT(RLP(ttq)) — 2e. 
Proof. The result follows from Lcmmas|9]and 10 and the fact that A* > (from (C.l)). □ 
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